Bootstrapping Toponym Classifiers

نویسندگان

  • David A. Smith
  • Gideon S. Mann
چکیده

We present minimally supervised methods for training and testing geographic name disambiguation (GND) systems. We train data-driven place name classifiers using toponyms already disambiguated in the training text — by such existing cues as “Nashville, Tenn.” or “Springfield, MA” — and test the system on texts where these cues have been stripped out and on hand-tagged historical texts. We experiment on three English-language corpora of varying provenance and complexity: newsfeed from the 1990s, personal narratives from the 19th century American west, and memoirs and records of the U.S. Civil War. Disambiguation accuracy ranges from 87% for news to 69% for some historical collections. 1 Scope and Prior Work We present minimally supervised methods for training and testing geographic name disambiguation (GND) systems. We train data-driven place name classifiers using toponyms already disambiguated in the training text — by such existing cues as “Nashville, Tenn.” or “Springfield, MA” — and test the system on text where these cues have been stripped out and on hand-tagged historical texts. As in early work with such named-entity recognition systems as Nominator (Wacholder et al., 1997), much previous work in GND has relied on heuristic rules (Olligschlaeger and Hauptmann, 1999; Kanada, 1999) and such culturally specific and knowledge intensive techniques as postal codes, addresses, and telephone numbers (McCurley, 2001). In previous work, we used the heuristic technique of calculating weighted centroids of geographic focus in documents (Smith and Crane, 2001). Sites closer to the centroid were weighted more heavily than sites far away unless they had some countervailing importance such as being a world capital. News texts offer two principal advantages for bootstrapping geocoding applications. Just as journalistic style prefers identifying persons by full name and title on first mention, place names, when not of major cities, are often first mentioned followed by the name of their state, province, or country. Even if a toponym is strictly unambiguous, it may still be labelled to provide the reader with some “backoff” recognition. Although there is only one place in the world named “Wye Mills”, an author would still usually append “Maryland” to it so that a reader who doesn’t recognize the place name can still situate it within a rough area. In any case, the goal is to generalize from the kinds of contexts in which writers use a disambiguating label to one in which they do not. Since news stories also tend to be relatively short and focused on a single topic, we can also exploit the heuristic of “one sense per discourse”: unless otherwise indicated — e.g., by a different state label — subsequent mentions of the toponym in the story can be identified with the first, unambiguous reference. News stories often also have toponyms in their datelines that are disambiguated. Our news training corpus consists of two years (1989-90) of AP wire and two months (October, November, 1998) of Topic Detection and Tracking (TDT) data. The test set is the December, 1998, TDT data. See table 1 for the numbers of toponyms in the corpora. In contrast to news texts, historical documents exhibit a higher density of geographical reference and level of ambiguity. To test the performance of our minimallysupervised classifiers in a particularly challenging domain, we test it on a corpus of historical documents where all place names have been marked and disambiguated. As with news texts, we initially train and test our classifiers on raw text. The range of geographic reference in these texts is somewhat similar to American news text: the corpus comprises the Personal Memoirs of Ulysses S. Grant and two nineteenth-century books of travel about California and Minnesota from the Library of Congress’ American Memory project.1 In all, we thus have about 600 pages of tagged historical text. 2 Experimental Setup Dividing the corpora in training and test data, we train Naive Bayes classifiers on all examples of disambiguated toponyms in the training set. Although it is not uncommon for two places in the same state, for example, to share a name, we define disambiguation for purposes of these experiments as finding the correct U.S. state or foreign country. This asymmetry is reflected in U.S. news and historical text of the training data, where toponyms are specified by U.S. states or by foreign countries. We then run the classifiers on the test text with disambiguating labels, such as state or country names that immediately follow the city name, removed. Since not all toponyms in the test set will have been seen in training, we also train backoff classifiers to guess the states and countries related to a story. If, for example, we cannot find a classifier for “Oxford”, but can tell that a story is about Mississippi, we will still be able to disambiguate. We use a gazetteer to restrict the set of candidate states and countries for a given place name. In trying to disambiguate “Portland”, we would thus consider Oregon, Maine, and England, among other options, but not Maryland. As in the word sense disambiguation task as usually defined, we are classifying names and not clustering them. This approach is practical for geographic names, for which broad-coverage gazetteers exist, though less so for personal names (Mann and Yarowsky, 2003). System performance is measured with reference to the naive baseline where each ambiguous toponym is guessed to be the most commonly occurring place. London, England, would thus always be guessed rather than London, Ontario. Bootstrapping methods similar to ours have been shown to be competitive in word sense disambiguation (Yarowsky and Florian, 2003; Yarowsky, 1995). 3 Difficulty of the Task Our ability to disambiguate place names should be weighed against the ease or difficulty of the task. In a world where most toponyms referred unambiguously to one place, we would not be impressed by near-perfect performance. Before considering how toponyms are used in text, we can examine the inherent ambiguity of place names in Our annotated data also includes disambiguated texts of Herodotus’ Histories and Caesar’s Gallic War, but toponyms in the ancient (especially Greek) world do not show enough ambiguity with personal names or with each other to be interesting. Corpus Train Test Tagged

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gazetteer-Independent Toponym Resolution Using Geographic Word Profiles

Toponym resolution, or grounding names of places to their actual locations, is an important problem in analysis of both historical corpora and present-day news and web content. Recent approaches have shifted from rule-based spatial minimization methods to machine learned classifiers that use features of the text surrounding a toponym. Such methods have been shown to be highly effective, but the...

متن کامل

Text-Driven Toponym Resolution using Indirect Supervision

Toponym resolvers identify the specific locations referred to by ambiguous placenames in text. Most resolvers are based on heuristics using spatial relationships between multiple toponyms in a document, or metadata such as population. This paper shows that text-driven disambiguation for toponyms is far more effective. We exploit document-level geotags to indirectly generate training instances f...

متن کامل

Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Word Translation Disambiguation Using Bilingual Bootstrapping

This article proposes a new method for word translation disambiguation, one that uses a machinelearning technique called bilingual bootstrapping. In learning to disambiguate words to be translated, bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages. It repeatedly constructs classifiers in the...

متن کامل

Selected Prior Research

• 1996 scaled tree-based classifiers to very large data sets. A fundamental challenge in data mining is to mine data sets that are so large that they do not fit into a computer’s memory. This is important for a wide variety of applications ranging from homeland defense to identifying fraudulent credit card transactions. One of the most accurate techniques in data mining is tree-based classifier...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003